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MODEL SELECTION IN LOGISTIC REGRESSION 
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Abstract. This paper is devoted to model selection in logistic regression. We extend the model 
selection principle introduced by Birge and Massart ( |2001| l to logistic regression model. This 
selection is done by using penalized maximum likelihood criteria. We propose in this context a 
completely data-driven criteria based on the slope heuristics. We prove non asymptotic oracle 
inequalities for selected estimators. Theoretical results are illustrated through simulation studies. 
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1. Introduction 

Consider the following generalization of the logistic regression model: let (Ti, ), • • • ,{Yn, xj, 
be a sample of size n such that (7,, x,) 6 {0,1} x A and 

TC 7 ^ exp/o(jC;) 

E,.(r,) = .,.(x,)=j^expAW 

where /o is an unknown function to be estimated and the design points xi,x„ are determinis¬ 
tic. This model can be viewed as a nonparametric version of the ’’classical” logistic model which 
relies on the assumption that x/ G and that there exists j8o g R^ such that fo(Xi) = jSqX,-. 

Logistic regression is a widely used model for predicting the outcome of binary dependent 
variable. For example logistic model can be used in medical study to predict the probability 
that a patient has a given disease (e.g. cancer), using observed characteristics (explanatory 
variables) of the patient such as weight, age, patient’s gender etc. However in the presence 
of numerous explanatory variables with potential influence, one would like to use only a few 
number of variables, for the sake of interpretability or to avoid overfitting. But it is not always 
obvious to choose the adequate variables. This is the well-known problem of variables selection 
or model selection. 

In this paper, the unknown function /q is not specified and not necessarily linear. Our aim is 
to estimate fo by a linear combination of given functions, often called dictionary. The dictionary 
can be a basis of functions, for instance spline or polynomial basis. 

A nonparametric version of the classical logistic model has already been considered by Hastie 


(1983), where a nonparametric estimator of fo is proposed using local maximum likelihood. The 
problem of nonparametric estimation in additive regression model is well known and deeply 


studied. But in logistic regression model it is less studied. One can cite for instance Lu (20061, 
Vexler ( |2006| ), Fan et al. ( 1998[ ), Farmen ( |1996[ ), Raghavan ( |1993| ), and Cox ( |1990| ). 
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Recently few papers deal with model selection or nonparametric estimation in logistic re¬ 
gression using penalized contrast Bunea ( |20081 ), Bach ( |2010| ), van de Geer ( |2008| ), Kwemou 
(2012). Among them, some establish non asymptotic oracle inequalities that hold even in high 
dimensional setting. When the dimension of X is high, that is greater than dozen, such i\ pe¬ 
nalized contrast estimators are known to provide reasonably good results. When the dimension 
of X is small, it is often better to choose different penalty functions. One classical penalty func¬ 
tion is what we call £q penalization. Such penalty functions, built as increasing function of the 
dimension of X, usually refers to model selection. The last decades have witnessed a growing 
interest in the model selection problem since the seminal works of Akaike ( |1973| ), Schwarz 
(|1978|). In additive regression one can cite among the others Baraud ( |2000a l), Birge and Massart 
( 2001 ), Yang ( |1999[ ), in density estimation Birge (| 2014| ), Castellan ( |2003a| ) and in segmentation 
problem Lebarbier ( 2005| ), Durot et al. ( |2009| ), and Braun etal. ( |2000 ). All the previously cited 
papers use T’o penalized contrast to perform model selection. But model selection procedures 
based on penalized maximum likelihood estimators in logistic regression are less studied in the 
literature. 

In this paper we focus on model selection using penalized contrast for logistic regres¬ 
sion model and in this context we state non asymptotic oracle inequalities. More precisely, 
given some collection functions, we consider estimators of /o built as linear combination of the 
functions. The point that the true function is not supposed to be linear combination of those 
functions, but we expect that the spaces of linear combination of those functions would provide 
suitable approximation spaces. Thus, to this collection of functions, we associate a collection 
of estimators of /q. Our aim is to propose a data driven procedure, based on penalized criterion, 
which will be able to choose the ’’best” estimator among the collection of estimators, using £q 
penalty functions. 

The collection of estimators is built using minimisation of the opposite of logarithm likeli¬ 
hood. The properties of estimators are described in term of Kullback-Leibler divergence and 
the empirical L 2 norm. Our results can be splitted into two parts. 

First, in a general model selection framework, with general collection of functions we pro¬ 
vide a completely data driven procedure that automatically selects the best model among the 
collection. We state non asymptotic oracle inequalities for Kullback-Leibler divergence and 
the empirical L 2 norm between the selected estimator and the true function /q. The estimation 
procedure relies on the building of a suitable penalty function, suitable in the sense that it per¬ 
forms best risks and suitable in the sense that it does not depend on the unknown smoothness 
parameters of the true function /q. But, the penalty function depends on a bound related to 
target function /q. This can be seen as the price to pay for the generality. It comes from needed 
links between Kullback-Leibler divergence and empirical L 2 norm. 

Second, we consider the specific case of collection of piecewise functions which provide es¬ 
timator of type regressogram. In this case, we exhibit a completely data driven penalty, free 
from /o. The model selection procedure based on this penalty provides an adaptive estimator 
and state a non asymptotic oracle inequality for Hellinger distance and the empirical L 2 norm 
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between the selected estimator and the true function /q. In the case of piecewise constant func¬ 
tions basis, the connection between Kullback-Leibler divergence and the empirical L 2 norm are 
obtained without bound on the true function /q. This last result is of great interest for example in 
segmentation study, where the target function is piecewise constant or can be well approximated 
by piecewise constant functions. 

Those theoretical results are illustrated through simulation studies. In particular we show that 
our model selection procedure (with the suitable penalty) have good non asymptotic properties 
as compared to usual known criteria such as AIC and BIC. A great attention has been made on 
the practical calibration of the penalty function. This practical calibration is mainly based on 
the ideas of what is usually referred as slope heuristic as proposed in Birge and Massart ( |2007j ) 
and developed in Arlot and Massart (2009). 

The paper is organized as follow. In Section we set our framework and describe our esti¬ 
mation procedure. In Section we define the model selection procedure and state the oracle 
inequalities in the general framework. Section is devoted to regressogram selection, in this 
section, we establish a bound of the Hellinger risk between the selected model and the target 
function. The simulation study is reported in Section The proofs of the results are postponed 
to Section!^ and |7l 


2. Model and framework 


Let (Ti, jci), • • ■ , (T„, Xn), be a sample of size n such that (7;, x,) 6 {0,1} x A. Throughout the 
paper, we consider a fixed design setting i.e. xi,... ,x„ are considered as deterministic. In this 
setting, consider the extension of the ’’classical” logistic regression model (2.1) where we aim 
at estimating the unknown function /o in 

expfo(xi) 


( 2 . 1 ) 


% o (^0 = 


1 -r exp fo{xi) 

We propose to estimate the unknown function /o by model selection. This model selection 
is performed using penalized maximum likelihood estimators. In the following we denote by 


P/(,(ai) the distribution of Ti and by ■ ,Xn) the distribution of (Ti,..., Y„) under Model 

(|2.1|). Since the variables T,’s are independent random variables. 


jC”) 

-fo 


(Xu ■■■ ,Xn) = n"^iP/o(A:;) = TlfyiXiY’il - nfy(Xi)) 


1-T, 


r=l 


It follows that for a function / mapping A into R, the likelihood is defined as: 


L„{f) = Ff(xi, ■ ■ • , A„) = ]~^ nf{Xif‘il - Tifixdf 

i=\ 


where 

( 2 . 2 ) 


nf{Xi) = 


exp (/(a,)) 

1 -r exp(/(A:/)) ■ 
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We choose the opposite of the log-likelihood as the estimation criterion that is 
(2.3) r„(/) = -^ log(L„(/)) = 1^1 log(l + - Ytf(xi)]. 

i=l 

Associated to this estimation criterion we consider the Kullback-Leibler information divergence 
defined as 



The loss function is the excess risk, defined as 

(2.4) 6(f) := y(/) - y(fo) where, for any /, y(f) = Efy[y„(f)]. 

Easy calculations show that the excess risk is linked to the Kullback-Leibler information diver¬ 
gence through the relation 

fi(/) = r(/)-r(/o) = ‘7C-(P^^pf). 

It follows that, /o minimizes the excess risk, that is 

/o = argminy(/). 

As usual, one can not estimate /o by the minimizer of y„(f) over any functions space, since it 
is infinite. The usual way is to minimize yn(f) over a finite dimensional collections of models, 
associated to a finite dictionary of functions \ X ^ E. 

D = {01,..., 0m}- 

For the sake of simplicity we will suppose that D is a orthonormal basis of functions. Indeed, 
if D is not an orthonormal basis of functions, we can always find an orthonormal basis of 
functions D' = {fi,, 0m'} such that 

(01, ... , 0m) = (0'1, • • • , 0M')- 

Let AI the set of all subsets m c {1,..., Mj. For every m 6 AI, we call Sm the model 

(2.5) Sm := [f^ = 

jem 

and Dm the dimension of the span of {fj,j 6 m}. Given the countable collection of models 
{SmjmeM^ wc define {fm}meM the Corresponding estimators, i.e. the estimators obtaining by 
minimizing y„ over each model Sm. For each m 6 AI, fm is defined by 

( 2 . 6 ) /„ = argminyAO- 

Our aim is choose the ’’best” estimator among this collection of estimators, in the sense that 
it minimizes the risk. In many cases, it is not easy to choose the ’’best” model. Indeed, a 
model with small dimension tends to be efficient from estimation point of view whereas it could 
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be far from the ’’true” model. On the other side, a more eomplex model easily fits data but the 
estimates have poor predietive performance (overfitting). We thus expect that this best estimator 
mimics what is usually called the oracle defined as 

(2.7) m* = arg min TCiPf, P^"^). 

meM /'n 


Unfortunately, both, minimizing the risk and minimazing the kulback-leibler divergence, re¬ 
quire the knowledge of the true (unknown) function /o to be estimated. 

Our goal is to develop a data driven strategy based on data, that automatically selects the 
best estimator among the collection, this best estimator having a risk as close as possible to 
the oracle risk, that is the risk of In this context, our strategy follows the lines of model 
selection as developed by Birge and Massart (2001). We also refer to the book Massart (2007) 
for further details on model selection. 

We use penalized maximum likelihood estimator for choosing some data-dependent m nearly 
as good as the ideal choice m*. More precisely, the idea is to select m as a minimizer of the 
penalized criterion 


( 2 . 8 ) 


m = arg min (r„(/„,) pen(m)), 

meM '■ ’ 


where pen : M. —> is a data driven penalty function. The estimation properties of f,n are 

evaluated by non asymptotic bounds of a risk associated to a suitable chosen loss function. The 
great challenge is choosing the penalty function such that the selected model m is nearly as good 
as the oracle m*. This penalty term is classically based on the idea that 

Eyi,7C(P^”\P^"^) + E/o7C(P5r\Pj,?) 

where is defined as 

fn = arg min 7(0. 

fSi" fYi 

Our goal is to build a penalty function such that the selected model m fulfills an oracle inequal¬ 
ity: 

7C(P^;\p'f^) < Cn inf *7C(Pf 

fm mE.A\ fm 

This inequality is expected to hold either in expectation or with high probability, where C„ is as 
close to 1 as possible and is a remainder term negligible compared to 7C(P^f\P^”^). 

In the following we consider two separated case. First we consider general collection of 
models under boundedness assumption. Second we consider the specific case of regressogram 
collection. 


m* = argminE/(,7C(P^r\P^?) = arg min 

meJSA fm meM 


3. Oracle inequality for general models collection under boundedness assumption 

Consider model ( |2.1| ) and {S,n)meM a collection of models defined by (|2.51 ). Let Co > 0 and 
Loo(Co) = [f:X^R, niaxi^;^„ \f(Xi)\ < Co}. For m 6 M, 7 „ given in (^2.3|), and 7 is given by 
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( |2.4[ ), we define 
(3.9) 


/„, = arg min y„{t) and = arg min y{t). 

(£5,„nLo„(Co) r£S„,nL„(Co) 


The first step eonsists in studying the estimation properties of for eaeh m, as it is stated in 
the following proposition. 


Proposition 3.1. Let Co > 0 and HAq 

have 


= e 


Co, 


l{\ + e^°)^. For m 6 M, let /„, and fm as in (|3.9|). We 


Eyi,['7C‘(P5'\Pf)] < 

Jm 


(n) 

fo ’ 


+ 

Jm 


Dm 

InFll 


This proposition says that the ’’best” estimator amoung the eolleetion in the sense of 

the Kullbaek-Leibler risk, is the one whieh makes a balanee between the bias and the eomplexity 
of the model. In the ideal situation where /o belongs to S^, we have that 


E/„[7C(p("\P^;^)] < 

Jm 


nul 2n' 


To derive the model seleetion proeedure we need the following assumption : 


(Ai) There exists a eonstant 0 < Ci < oo sueh that max |/o(x;)| < Ci. 

l^i^n 

In the following theorem we propose a ehoiee for the penalty funetion and we state non asymp- 
totie risk bounds. 

Theorem 3.1. Given Cq > 0, for m 6 AI, let and be defined as 
II / ||^= n^~^^ TJi=\ Let {LjfimeM some positive numbers satisfying 

S = ^ exp(-L„,D„,) < oo. 
meAl 

We define pen : AI ^ R+ , such that, for m 6 AI, 

pen{m) > -I- V^j ’ 

where A is a positive constant depending on ci. Under Assumption ( |Ai| ) we have 
E,.[7f(P«, P“)] < C inf P«) + pen(m)\ + C. ? 

and 

E/o II fm - fo ll^< C' inf ||| fo - fm III +pen{m)] + C[-. 

meM '■ ’ n 

where C, C, Ci,C\ are constants depending on ci and Cq. 


(3.9i Let us denote 
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This theorem provides oraele inequalities for L 2 -norm and for K-L divergenee between the 
seleeted model and the true funetion. Provided that penalty has been properly ehosen, one ean 
bound the L 2 -norm and the K-L divergenee between the seleeted model and the true funetion. 


The inequalities in Theorem 3.1 are non-asymptotie inequalities in the sense that the result is 
obtain for a fixed n. This theorem is very general and does not make speeifie assumption on the 
dietionary. However, the penalty funetion depends on some unknown eonstant A whieh depends 


on the bound of the true funetion /o through Condition (6.5). In praetiee this eonstant ean be 


ealibrated using ’’slope heuristies” proposed in Birge and Massart (20071. In the following we 


will show how to obtain similar result with a penalty funetion not eonneeted to the bound of the 
true unknown funetion /o in the regressogram ease. 

4. Regressogram functions 

4.1. Collection of models. In this seetion we suppose (without loss of generality) that /o : 
[0,1] ^ R. For the sake of simplieity, we use the notation /o(a:,) = /o(/) for every i = I,... ,n. 
Henee /o is defined from {1,..., n} to R. Let AI be a eolleetion of partitions of intervals of 
X = {1,..., n}. For any m e M and J e m, let Ij denote the indieator funetion of J and 

be the linear span of {Ij, J e m}. When all intervals have the same length, the partition is said 
regular, and is is irregular otherwise. 

4.2. Collection of estimators: regressogram. For a fixed m, the minimizerof the empirical 
contrast function y„, over 5is called the regressogram. That is, /o is estimated by given by 


(4.10) 

where 7 „ is given by 

(4.11) 


= argmin 7 „(/). 

m 

. Associated to 5 we have 


fm = argmin 7 (/) - 7 (/o) = argmin'7C(P' 


in) m(«) 


). 


In the specific case where is the set of piecewise constant functions on some partition m, fm 
and fm are given by the following lemma. 


Lemma 4.1. For m e M , let fm and fm be defined by { 4.11 ) and { 4.10 ) respectively . Then, 

fm = Zysm 'fm^J and fm = Z/Em Wlth 


-fJ) , 

fm =l0g 


TjieJ 


l-^Kl - ZieJ^fo(xd/\J\) 

JJ) 


and = log 




im-ZiejYi/lJ]) 


Moreover, and nf = with 


= Ti z = iT Z 

iej iej 


Consequently, = argmin;r£S„ II ^ - ^/o ll„ is the usual projection of on to Sm- 
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4.3. First bounds on Consider the following assumptions: 


(A2) 


There exists a eonstant p > 0 sueh that min n^ixi) > p and min [1 - 7r/o(x,)] > P- 

i=\p",n 


Proposition 4.1. Consider Model {2.1) and let be defined by {4.10) with m such that for all 
Jem, |7| > r[log(n)]^/or a positive constant T. Under Assumption (A 2 ), for all 6 > 0 and 
a > I, we have 


E/.wpw.pi')] < ■A-(pS;’,p«)) + T/p- 


j(«) 


m k(Y,p,6) 


4.4. Adaptive estimation and oracle inequality. The following result provides an adaptive 
estimation of /o and a risk bound of the seleeted model. 

Definition 4.1. Let Mbe a collection of partitions ofX = { 1 ,..., n} constructed on the partition 
mf i.e. mf is a refinement of every m 6 At. 

In other words, a partition m belongs to At if any element of m is the union of some elements 
of mf. Thus S,nf eontains every model of the eolleetion {S,„}meM- 

Theorem 4.1. Consider Model ( |2.ip under Assumption {^). Let {S ,„,m 6 At} be a collection 
of models defined in Section 4.1 where At is a set of partitions constructed on the partition mf 
such that 

(4.1) for all J e mf,\J\>Y\o^{n), 

where Y is a positive constant. Let {Lm)meM be some family of positive weights satisfying 

(4.2) 


S = 2^ exp(-L,„D,„) < -TOO. 

meM 


Let pen : At 


I.+ satisfying for m 6 At, and for p > 1, 

A 

n 


Let f = fm where 
then, for C^ = - 1), we have 


pen{m) > p— ^1 -I- 6 L„, -l- 8 V^) • 

rA/m) + ppn(m)), 


m = arg min 

meM. 


(4.3) 


E,.[V-(P«, P“)J S C, mf (7f(P“, P“) + pen{m)\ 


+ 


c(p,p,r, z) 


This theorem provides a non asymptotie bound for the Hellinger risk between the seleeted 
model and the true one. On the opposite of Theorem |3.1[ the penalty funetion does not depend 
on the bound of the true funetion. The selection procedure based only on the data offers the 
advantage to free the estimator from any prior knowledge about the smoothness of the func¬ 
tion to estimate. The estimator is therefore adaptive. As we bound Hellinger risk in (4.3|) by 
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Kulback-Leibler risk, one should prefer to have the Hellinger risk on the right hand side in¬ 
stead of the Kulbaek-Leibler risk. Sueh a bound is possible if we assume that log(||;r/Q/p||oo) is 
bounded. Indeed if we assume that there exists T sueh that log(||:7r/o/p||oo) < T, this implies that 
log(||7r/o/;Ty;J|oo) < T Unifo rmly for all partitions m 6 M. Now using Inequality (7.6) p. 362 in 
Birge and Massart (|l998|) we have that P^”^) < (4 + 2 log(A/))h^(Pyj,, Py^) whieh implies, 


« G.C(r) inf 


-I- 


c(p,p,r, s) 


Choice of the weights {Lm,m 6 M}. Aeeording to Theorem |4.1[ the penalty funetion depends 
on the eolleetion A1 through the ehoiee of the weights L,„ satisfying (4.2), i.e. 

(4.4) ^ ~ exp(-L„,D„,) = ^ e~^‘^^Card{m 6 A1, |m| = D] < oo. 

me-M D>1 

Henee the number of models having the same dimension D plays an important role in the risk 
bound. 

If there is only one model of dimension D, a simple way of ehoosing Ld is to take them 
eonstant, i.e. Ljj = L for all m 6 A1, and thus we have from ( 4.4\ 

-LD 


£>>1 


< oo. 


This is the ease when AI is a family of regular partitions. Consequently, the ehoiee i.e. Lq = L 
for all m 6 AI leads to a penalty proportional to the dimension D,„, and for every D,„ > 1, 


(4.5) 


pen(m) = p(l -l- 6L -l- 8 VZj 


D„ 


D„ 


— = c X — 
n n 


In the more general eontext, that is in the ease of irregular partitions, the numbers of models 
having the same dimension D is exponential and satisfies 


Card[m 6 AI, |m| = d| = 


In that ease we ehoose depending on the dimension D^. With L depending on D, E in (4.2) 
satisfies 


^ = Z 

£|>1 

-< z 


-LdD 


Card{m 6 AI, \m\ = D} 


-LoD^ 


D>1 


n 

Dl 


< 


£|>1 


Z' 

£|>1 


-D^Lo-l-logCg)) 
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So taking Ld = 2 + log (■^) leads to S < oo and the penalty beeomes 


(4.6) 
where 

(4.7) 


pen(m) = ju x 


PenshapeC"*) = ^ a/^ + (^)]- 


The eonstant /i ean be ealibrated using the slope heuristies Birge and Massart (2007) (see See 


tion 5.2). 


Remark 4.1. In Theorem \4.1\ we do not assume that the target function /o is piecewise constant. 
However in many contexts, for instance in segmentation, we might want to consider that /q is 
piecewise constant or can be well approximated by piecewise constant functions. That means 
there exists of partition of X within which the observations follow the same distribution and 
between which observations have different distributions. 

5. Simulations 

In this seetion we present numerieal simulation to study the non-asymptotie properties of the 
model seleetion proeedure introdueed in Seetion |4.4[ More preeisely, the numerieal properties 
of the estimators built by model seleetion with our eriteria are eompared with those of the 
estimators resulting from model seleetion using the well known eriteria AIC and BIC. 


5.1. Simulations frameworks. We eonsider the model defined in ( |2.1[ ) with ff : [0,1] 
The aim is to estimate ff. We eonsider the eolleetion of models {Sm)meM^ where 

Sm = VeetlUr^ _^r sueh that 1 < k < DJt, 

LOm ’£>m 

and At is the eolleetion of regular partitions 

( k - I k [ 1 

m = \ sueh that \<k< D,„, \, 

where 


Dm < 


\ogn 


The eolleetion of estimators is defined in Lemma [4~T] Let us thus eonsider four penalties. 
• the AIC eriretion defined by 

Dm 

PenAIC = —; 


• the BIC eriterion defined by 


PS’^BIC = 
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• the penalty proportional to the dimension as in (4.5) defined by 

Dm 


n 


peniin = c x 

• and the penalty defined in ( 4.6| ) by 

pen = // X pengj^apg(m). 

penpj^ and pen are penalties depending on some unknown multiplieative eonstant (e and jx 
respeetively) to be ealibrated. As previously said we will use the ’’slope heuristies” introdueed 
in Birgea nd Massart (20071 to ealibrate the multiplieative eonstant. We have distinguished two 
oases: 

• The ease where there exists mo & M suoh that the true funotion belong to i.e. 
where /o is pieoewise eonstant, 


Modi: /o - 0.5]I[o,i/3) + ]I[i/3,o.5) + 2]I[o.5,2/3) + 0-25I[2/3,i] 

Mod2: /o = 0.75]I[oj/4] + 0.5]I[i/4_o.5) + 0.2]I[o . 5 , 3 / 4 ) + 0.3]I[3/4j]. 

• The seoond ease, /o does not belong to any Sm,m & M and is ehosen in the following 
way: 


Mod3: /o(t) = sin(;rx) 
Mod4: /o(v) = ^/x. 


In eaeh ease, the jc,’s are simulated aooording to uniform distribution on [0,1]. 

The Kullbaok-Leibler divergenoe is definitely not suitable to evaluate the quality of an esti¬ 
mator. Indeed, given a model S,„, there is a positive probability that on one of the interval I e m 
we have = 0 or = 1, whioh implies that , n^f) = -l-oo. So we will use the Hellinger 

fm fm fm 

distanee to evaluate the quality of an estimator. 

Even if an oraele inequality seems of no praetieal use, it ean serve as a benehmark to evaluate 
the performanee of any data driven seleetion proeedure. Thus model seleetion performanee of 
eaeh proeedure is evaluated by the following benehmark 


(5.8) 



C* evaluate how far is the selected estimator to the oracle. The values of C* evaluated for 
each procedure with different sample size n 6 {100,200,..., 1000} are reported in Figure]^ , 
Figure^ Figure]^ and Figure]^ For each sample size n e {100,200,..., 1000}, the expectation 
was estimated using mean over 1000 simulated datasets. 
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5.2. Slope heuristics. The aim of this section is to show how the penalty in Theorem |4.1| can 
be calibrated in practice using the main ideas of data-driven penalized model selection criterion 
proposed by Birge and Massart (|2007|). We calibrate penalty using ’’slope heuristics” first intro¬ 


duced and theoretically validated by Birge and Massart (20071 in a gaussian homoscedastic set¬ 
ting. Recently it has also been theoretically validated in the heteroscedastic random-design case 


by Arlot (20091 and for least squares density estimation by Lerasle (2012). Several encouraging 


applications of this method are developed in many other frameworks (see for instance in clus¬ 
tering and variable selection for categorical multivariate data Bontemps and Toussile ( |2013| ), for 
variable selection and clustering via Gaussian mixtures Maugis and Michel (|2011[), in multiple 


change points detection Lebarbier (20051). Some overview and implementation of the slope 


heuristics can be find in Baudry et al. (2012). 

We now describe the main idea of those heuristics, starting from that main goal of the model 
selection, that is to choose the best estimator of /o among a collection of estimators {fm]meM- 
Moreover, we expect that this best estimator mimics the so-called oracle defined as ( 2.1) . To 
this aim, the great challenge is to build a penalty function such that the selected model m is 
nearly as good as the oracle. In the following we call the ideal penalty the penalty that leads to 
the choice of m*. Using that 


7C(P<”\pf) = r(/j 

Jm 


-fo 


rifo). 


then, by definition, m* defined in (|2.7|) satisfies 


m* = arg min[ 7 (/„,) 

meM 


y(/o)] = argminyC/n). 
meM 


The ideal penalty, leading to the choice of the oracle m*, is thus [y(/m) - 7n(fm)], for m 6 Al. 
As the matter of fact, by replacing pen,y(/m) by its value, we obtain 


arg min[y„(/„,) -t peny(/„,)] 

meM 


arg min[y„(/m) + y{fm) - r«(/m)] 

meM 

arg min[y(A0] 

meM 


= m * 


Of course this ideal penalty always selects the oracle model but depends on the unknown func¬ 
tion /o throught the sample distribution, since y{t) = Eyg[y„(t)]. A natural idea is to choose 
pen(m) as close as possible to pen;_^(m) for every m 6 M. Now, we use that this ideal penalty 
can be decomposed into 

peny(m) = y{f,„) - yn(fm) = + Vm + 


where 

Em — y(.fm) y^fm)^ ~ yn^fn^l yni.fm\ and Cm — y(.fm) yn(.fm}- 
The slope heuristics relies on two points: 

• The existence of a minimal penalty penjj^|j^(m) = Vm such that when the penalty is 
smaller than penj^^^j^ the selected model is one of the most complex models. Whereas, 
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penalties larger than lead to a seleetion of models with ’’reasonable” eomplex- 

ity. 

• Using concentration arguments, it is reasonable to consider that uniformly over A1, 
Jnifn) is close to its expectation which implies that Cm ~ 0. In the same way, since 
Vni is a empirical version of v^, it is also reasonable to consider that Vm ^ Vm- Ideal 
penalty is thus approximately given by 2v„„ and thus 


peny(m) « 2pen^;„(m). 

In practice, can be estimated from the data provided that ideal penalty peny(.) = A!',ypen^,,^pg(.) 
is known up to a multiplicative factor. A major point of the slope heuristics is that 

yPen,,„p,(.) 

is a good estimator of and this provides the minimal penalty. 

Provided that pen = k x pen^;,^^^ is known up to a multiplicative constant k that is to be 
calibrated, we combine the previously heuristic to the method usually known as dimension 
jump method. In practice, we consider a grid xi,... ,/cm, where each kj leads to a selected 
model m^- with dimension . The constant which corresponds to the value such that 
pen„,;„ = Kmin X estimated using the first point of the ’’slope heuristics”. If is 

plotted as a function of Kj, K^m is such that is ’’huge” for k < K^m and ’’reasonably small” 
for K > Kmin- So Kmin is the valuc at the position of the biggest jump. For more details about this 
method we refer the reader to Baudry et al. (2012) and Arlot and Massart ( 2009| ). 

Figures]^ andare the cases where the true function is piecewise constant. Figurej^and Fig- 
ure[^are situations where the true function does not belong to any model in the given collection. 
The performance of criteria depends on the sample size n. In these two situations we observe 
that our two model selection procedures are comparable, and their performance increases with 
n. While the performance of model selected by BIG decreases with n. Our criteria outperformed 
the AIC for all n. The BIG criterion is better than our criteria for n < 200. For 200 < n < 400, 
the performance of the model selected by BIG is quite the same as the performance of models 
selected by our criteria. Finally for n > 400 our criteria outperformed the BIG. 

Theoretical results and simulations raise the following question : why our criteria are better 
than BIG for quite large values of n yet theoretical results are non asymptotic? To answer 
this question we can say that, in simulations, to calibrate our penalties we have used ’’slope 
heuristics”, and those heuristic are based on asymptotic arguments (see Section [5^. 


6. Proofs 


6.1. Notations and technical tools. Subsequently we will use the following notations. Denote 
by II / \\n and {f,g)n the empirical euclidian norm and the inner product 


/ 


|2 


- y/(JC/), and {f,g)n 

n ^ 

i=i 


1 ” 

- y fixi)g(xi). 

n ^ 

i=i 
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Figure 2. Model selection performance (C*) as a function of sample size n, 
with each penalty, Modi. 


Note that |i . |i„ is a semi norm on the space !F of functions g : X —> R, but is a norm in the 
quotient space T/*R associated to the equivalence relation "R: gRh if and only if g(xi) = h{xi) 
for all / G {1,..., n}. It follows from (2.31 that y defined in (2.4) can be expressed as the sum 
of a centered empirical process and of the estimation criterion y„. More precisely, denoting by 
e = (ei, • • • , SnY, with st = 7,- - for all /, we have 
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Figure 3. Model selection performance (C*) as a function of sample size n, 
with each penalty, Mod2. 


Easy calculations show that for y defined in (2.4) we have, 




= - f] 

n J 

1 ” 

= E 


A 

t)(n) 

V"/ ) 


C^ = r(/)-r(/o) 


i=l L 


n^iXi) \og 


nf{Xi) / 


+ (1 -;r/„(A:,))log 


1 - nfy{Xi) 

1 - nf{Xi) I 


Let us recall the usual bounds (see Castellan ( |2003b| )) for kullback-Leibler information: 
Lemma 6.1. For positive densities p and q with respect to p, if f = \og{q/p), then 

^ J /(I A e^)pdp < 'K(p,q) < ^ J"/(I V e^)pdp. 
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Figure 4. Model selection performance (C*) as a function of sample size n, 
with each penalty, Mod3. 


6.2. Proo f of P roposition 3.1 ; By definition of fm, for all / e nLoo(Co), yn{fm)-yn{f) < 0. 


We apply (6.11, with / = and / = 


yifm) - r(/o) < y(fm) - r(/o) + (s, fm - fm)n- 

As usual, the main part of the proof relies on the study of the empirical process (e,/„ - fm)n- 
Since fn - fm belongs to fn - fm = Z°='i where fnj, is an orthonormal basis 

of S m and consequently 


D„, 

(^5 fm fm}n ^ f j)n- 

,/=l 
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Figure 5. Model selection performance (C*) as a function of sample size n, 
with each penalty, Mod4. 


Applying Cauchy-Schwarz inequality we get 



We now apply Lemma [O] (See Sectionfor the proof of Lemma |6.2| ) 

Lemma 6.2. Let Sm the model defined in ( |2.5| ) and {fii,..., if /d fit an orthonormal basis of the 
linear span {fi^, k e m). We also denote by A„j the set of P = such that ffi.) = 

Tjf=i jf j(-) satisfies f^ 6 S,n n Loo(Co). Let fi* be any minimizer of the function f jiffi over 
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Am, we have 

14^ 

(6.2) ^11//? - fpAt ^ y^fn) - 

where 41 q = 


Then we have 


fm fm)n ^ 


A 


7=1 


^ (<e, yjyifm) - y(fm 


Now we use that for every positive numbers, a, b, x, ab < (xl2)a^ + [l/(2v:)]Z7^, and infer that 

Dm 

X 


yifm) - r(/o) < y(fm) - r(/o) + ^ 71 X 7 - yifm))- 

<0 y=i 


For x> 1/2, it follows that 


E/oWm) - r(/o)] < y{f,n) - r(/o) + 


2^2 

{2x-\)nAl 


Vo 


- 7=1 


We conclude the proof by using that 


Vo 






17=1 




Dm 

An 


□ 


6.3. Proof of Theorem |3.1t By definition, for all m e A1, 

ynifm) + pen(m) < y„(/m) + pen(m) < ynifm) + pen(m). 


Applying ( |6.1| ) we have 

(6.3) *7C(P$;\ P|^) < 7C(P^J, Pj;) + <^, U - fm)n + pen(m) - pen(m). 

.A. 

It remains to study {s, - fm)n, using the following lemma, which is a modification of Lemma 

1 in Durot et al. (120091. 


Lemma 6.3. For every D, D' and x>Qwe have 


sup 


ue(^S on]Li>,(Co)+5' ^/HLcoICo)^ 


u 11^ 


-) 


{s, u)„ D + D' 


Stc 


An 



+ a/ — 
n 


< expC-jc). 
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Fix ^ > 0 and let denote the event 


Q^irn) = 


sup 


<e, u)„ 


< 


L£(5„nL„(Co)+5,„/nL>„(Co)) 


Dm + Dr, 


An 


- + v]5{Lm'Dm' + 0In I 


Then we have 
(6.4) 


> 1 - Sexp(-^). 


See the Appendix for the proof of this lemma. Fix ^ > 0, applying Lemma [0| we infer that 
on the event Fl^{m), 


fin fm)n ^ 


Idm "I" Dffi / LfiiDfii -V ^ 








An 


vj Dfii 


1 I5L, 

-I- 


V4n 


li fm fm \\n 

(ll fm - fo IL + II /o ~ fm lln) 

fm - fo ll« + II /O - fm IIh) . 


+ 

An y n 


Applying that 2xy < 6x^ + 6 for all x > 0, y > 0, 0 > 0, we get that on Q.^{m) and for every 

i7 6]0,1[ 


-I- 




(Sjm - fm)n < (^^) [(^ + d) II fm “ /o 11^ +(1 +d II /o “ fm 11^] 

1 


2 ( 1 - 77 ) 

1 - ?7^ 


(l + i/)A? 


V477 


-P 


5 Ln 


a ,d ^ - d 


+ a+r]~l 


I + T] 


A 

477 


-P 



5^ 


2n 


II fm - fo \\n II fo - fm 11^ 


2 ■’ ■' " 2 
1 + 77 “^^ An , 5^^ 


2(1-77) 


1 ^ / SLfii 


V 477 


L -r '/ / ... 

+- —\ — + — 

1 - 77 ( 


T] ^ An n 

\2 ■ 


If pen(777) > (dAn + V5A) )/f^, with d > 0, we have 


fh fm')n ^ 


^-d\, ^ ^ l|2 A ' “ ^ 11 7- 7- I|2 , 1+^7 _, 1 + ^ 


-1 


II A-/oll^+- 


II fo - fm ll„ + 


2 ( 1 - 77)4 


pen(7w) -P 


(l-77)d 


pen(m) 


-P 


1 + 77-1 5^ 


1 - 77 77 
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It follows from (|6.3|) that 

3 O) p(”, 

■ /o ’ fm 




+ - 


/o ’ U 

1+77 


2 ( 1 - 77 M" (I-77M 

Taking A = {rj + 1)/(2(1 - 77 )), we have 


^-Wfo-fmt 

1 + 77“^ 1 + 77“^ 5 ^ 

pen(777) + —-pen(m) + —-h pen(7n) - pen(777). 


1 - 77 n 


/O JO Jm 


+ 


AA 


(2A + 1)2 


fm - fo \\l + 


4A 


4^2-1 


fo - fm \\n + 


6A+ 1 
2T- 1 


pen(777) + 


10T(2T+ 1)^ 
2A- 1 n' 


Now we use the following lemma (see Lemma 6.1 in Kwemou (2012)) that allows to eonneet 
empirieal norm and Kullbaek-Leibler divergenee. 


Lemma 6.4. Under Assumptions ( |Ai| ), for all m £ M and all t £ SmC\ Loo(Co), we have 

CminWt - foWl < •TCCP^J.Pl”^) < C^A\t - foWl 
where Cmin and Cmax are constants depending on Cq and c\. 

Consequently 


< ClCmin) + pen(m)) + 


where 


1 + 


42 


C(c,„,„) = max - 


(422-1)c„ 


62+1 

22-1 


42 


42 


Cm7>7(2/i+ 1) 


Thus we take A sueh that 
(6.5) 


C7jj/;7(2/i+ !)“■ 

4d 

Cmm(2‘A + 1)2 


and C 


> 0 , 


102 ( 22 + 1 ) 

22-1 

_ 42 

f^07/H(22+ 1)^ 


where Cmm depends on the bound of the true funetion /q. By definition of Q.^{m) and (6.4), there 
exists a random variable V > 0 with P(y > ^) < Sexp (-^) and B/oiV) ^ sueh that 

7C(Pj;\p|^) < dc^in) |7C(P^”\P^”^) + pen(777)) + Ci(c™„)^, 

whieh implies that for all m £ A1, 

E/„[7C(P^;;\P^] < C(c™„) {*7C(P^;;\P^"^) + pen( 777 )) + Ci(c™„)-. 


This eoneludes the proof. 


□ 
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6.4. Proof of Proposition 4.1 ; Let f^, fm, and given in Lemma [43j proved in appendix. 
In the following, = |m|. For d > 0, let be the event 


( 6 . 6 ) 


^miS) = 


Jem 


71 


{}) 
' fm 


1 




4 n 


1 - 

fm 


1 - 

Jm 


1 




Aecording to pythagore’s type identity and Lemma [4~T] we write 

j(«) p(« 

■/o’ /« 


= 7C(p(;;\pf) + 7C(pf ,p<^)nn,„(<5) + ‘?c(pf ^ 

JO JU Jm Jm Jfjj Jm 


where 


(6.7) 7C(p^;\p^;^) = - y 

Jm ^ 

i=\ 

= -Zw 


;r/^(x,)log 




+ (I - nf^ixi)) log 


1 - 
1 - 


Jem 




7« 


Jm 

V fm ^ 


+ (1 - log 


\-nP^ 

Jm 
fm ff 


The first step consists in showing that 

(6-8) 2(1 + 5)2 
where 

ilukej^kf 


7,Ia.(« < < 2(1 1 ^)2 -y7a.m. 


(6.9) 




n I 


|7|;ry^[l-y"^] 


with 


4p2D„ 


< Eyi,[A'„J < 


2D„ 


/em I"' f L-*- f 

/«7 7/H 

The second step relies on the proof of 


( 6 . 10 ) 


,A(«'(P«,P“)lIn;,,«j « 21ogjajP[nyi)]. 


,)|«2iog(i): 


The last step consists in showing that for e > 0, since for all Jem, |7| > r[log(n)]^, where 
T > 0 is an absolute constant, then we have 


( 6 . 11 ) 


P[fl,(,(d)] < 4|m| exp - 


2(1 + d/3) 

Gathering ( |6.8| )-( |6.ir| ), we conclude that 


p r[log(n)] < 


Kip, d, T, e) 
^(l+e) 


E^„[*7C(Pypp] < <Ki¥p, P^,;:^>) + ^ + 2 log (-) 


fa ^ fm 


a^.21og^-jP[n^(fll 




-fa ’ fm' • (i-sfn 

We finish by proving ( |6.8[ ), ( |6.9[ ), ( |6.10[ ) and ( |6.11| ). 


„(l+u 
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Proof of (6.8) and (6.9|) : Arguing as in Castellan (2003b I and using Lemma [CT] we have 


7C(Pf,Pp 

jm 2n 


Jem 




1 A — 

^fn ) 


log^ 


Jm 

V fm ^ 


+ (l-7rf) 


1 - tt' 




1 A 


1 - tt' 


.(J) 

fn, y 


log^ 


Jm 

fm 


and 


7C(pf ,p^;^) <^Y\J\ 

Jm 2n 

Jem 


n 


U) 


1 V — 
J.J) 

71 r 

\ fm J 


log 


(n^P^ 

Jm 

71 ^ 

V fm 7 


+ {l-7i:pp 


I - n 


.iJ)\ 


1 V 


1 - TT 


,( 7 ) 

fm ) 


log 


Jm 

V fm 7J 


It follows that 


where V'^{nf^,nfJ is defined by 

1 - nP^f 

(6.13) V\n,,,,nfJ = -y\J\ "" " 


Jem 


71 


,( 7 ) 


log[;ryV7ry^] 

Jm 

nf/nP^ - 1 

Jm 7m 


n2 


+ 


n ^ 1 

Jem ^ 


[np^ - nff \ log[(l - ;r(/V(l - nf)] 

fm 7m f r~ 


n 


,(7) 


t2 


(l-;r'/V(l-<')-! 

fm 


Now we use that, for all x > 0, 
(6.14) 

Henee we infer that 


1 . log(T) 1 

^ ^ 


IVa: jc-l lAjc 


(1 + ^ y^i^fm^^fJ^nmiS) < (J _ ^p'^mllOmO)’ 


with defined in (6.91. This entails that ( 6.81 is proved. It remains now to eheek that 

4p^|m| 




2|m| 


Aeeording to Lemma 4.1 , for all partition Jem and for any jc, 6 J, 


“■f.fe) = 'T?. 


with n<-’> = ^yr, 

fm / i 

' ' ieJ 

and nfSxi) = 4?’ with nP^^ = 4 Xi 

ieJ 
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Consequently, 

and finally 


Jem 


{YikeJ 

llkej ^foi^kWl - ZkeJ ^fo(Xk)] n \J\n^p[l - /P] ’ 

Jm Jm 




Consequently 


Jem 


‘ CLkejfkf_' 
\J\nP[l-nP] 


-E 

n 4-^ 

Jem 


1 


\jWf\l-nP] 


XiVar(n). 


kej 


1 ^ Z,67^/0(^0(1 -nfy(Xi)) 

Jem 


^fMl) „E |y|;,«[i_V/'j 

/m 


fm 


Now, aeeording to Assumption (A 2 1, and Lemma 4.1, for all partition m, all Jem, and all 

Xi e J 


0 < < :7r/o(A,)(l - T^fjxi)) < 1/4, and 0 < p < and 0 < p < (1 - n^). 


It follows that 
4p2 < 


2 ^YjkeJ^ki^k){^-7lfy{Xk)) _YjkeJ^fyiXk){^-^fJXk)) ^ YjkeJ ^fJXk)il - nfyiXj,)) 


\j\npv\-np] 


\JW 


iJ) 


+ 


Jm 


< 2 , 


and thus 


4pVl ^ 1 Ziej^fo(Xi)ii -^foiXi)) ^ 2\m 


n n 


E 

Jem 






In other words. 


4p^|m| ^ ^ / v /2 X ^ 2.\m\ 




€ 


The ends up the proof of ( |6.8| ) and ( |6.9[ ). 

Proof of (6.10[) : We start from (6.71, apply Assumption (|A^ and Lemma [4~T| to obtain that 


and (6.10) is eheeked sinee 


|E(7C(Pj:\p(/^)]In,(.))l < ^ 


1 ” 

-E- 

n 4—^ L 


log 






-I- 


1 ” 

-E: 

n 


i=l 


log 


(1 - 

(1 - TTf^PXi)) 




(« 


< 21og - P[Q/„(d)]. 
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• Proof of (6.11): We come to the control of Since 




Jem 


kj) 

fm 


71 


Jm -j 


> 




Jem 


1 - 

fm 


1 - 

Jm 


- 1 




by applying Lemma [4~Tj we infer that 


n . 

fm -I 


>6^ = 


^kej 


and 


\-nf^ 

fm 


1 - 

Jm 


> 6 


TjkeJ ^foi^k) 


flikej ^k 


>6\ = 


Zie/Cl -^fo(Xk)) 


>6\ = 


Y^Sk >6Y^fo(^kn, 
kej kej I 


^ - nfy ( xk))y 

keJ keJ I 


We write 


and 


[ < P TTfgiXkXl -7r/ofe))i 

keJ keJ } V kej kej } 

-7r/o(T*))i < P e^^Me ^fo(Xk)(i -7r/o(T<t))l 

keJ keJ ] V kej kej ] 


Then we have 


P[f^^,(^)] < 2 nfJxkXl - 7ryi,(v:i)) i . 

Jem V keJ kej J 

Now, we apply Bernstein Concentration Inequality (see Massart ( |2007| ) for example) to the right 
hand side of previous inequality, starting by recalling this Bernstein inequality. 

Theorem 6.1. Let Zi, • • • ,Z„be independent real valued random variables. Assume that there 
exist some positive numbers v and c such that for all k ^ 2, 


Ze[iz,i‘] 


(=1 


kl 


k-2 


Then for any positive z, 


( n 


Z(z. - E(Z,) > + 


cz 


V ;=1 


/ n 


< exp(-z), and ] 


Y(Zi - E(Z0 > z 


V i=l 




exp 


2 (v + cz) 
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Especially, i/|Z,| < b for all i, then 


( n 


(6.15) 


^(Z,- - E(Zi) > z 


V (=1 




exp 


2(ZLE(Z2) + fe/3) 


Applying ( |6.15p with z = 6 T,kej ^- nfy(xk)), h = 1 and v = Zkej ^ 
we get that 


is less than 


2 exp 


'^Sk nfyixk){l - TTfyixk)) > 

kej kej } 


^HljkeJ^fo(Xk)(i -nfy(Xk))f 


^{ZkeJ^fo(Xk)(l - nfyiXk)) + (6/3) ZkeJ^fo(Xk)('^ “ nfy(Xk))) 


and eonsequently 


'^Sk >6'^ ^fo(Xk)(i - T^fyi^k)) I < 2 exp 

< 2 exp 


keJ 


keJ 


( 


2(1 + d/3) 
6^ 

'2(1 + d/3) 


Y^nfy(Xk)(l -TTfyiXk)) 
ke. 

\J\P- 


V keJ 
2 


Consequently, 


P[QJ/(d)] < 4|m| exp(-Ap^r[log(n)]^), with A = 


2(1 + d/3)' 


where F is given by (4.1). For e > 0 and d sueh that 

d^ 


(6.16) 


2(1 + d/3) 


p riog(R) >2 +e. 


using that |m| < n implies that 

4|m| exp I - 

And Result ( |6.11| ) follows. 

6.5. Proof of Theorem 14.11 


d^ 9 ■}\ K 

p^r[iog(R)]^ < 


2(1 + d/3) 


„(i+U • 


By definition, for all m 6 A1, 

7n(fm) + pen(m) < jnifm) + pen(m) < yn(fm) + pen(m). 


Applying Formula (6.1), we have 

(6.17) y(f/h) - y(fo) < y(fm) - r(/o) + (e, fm - fm)n + pen(m) - pen(m). 
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Following Baraud ( 2000b[ ) or Castellan (2003b I, instead of bounding the supremum of the em- 
pirieal proeess (e, f,~„ - f,n)n, we split it in three terms. Let 

7„(0 = 7n{t) - Bfyijnit)) = - <s,f>„ 

with < s,f >n defined in ( |6.1| ), and write 

r(/m) - r(/o) < 7(/m) - 7(/o) + pen(m) - pen(m) 

+7n(/M) - 7„(/o) + 7„(/o) - 7«(/m) + Jni-fm) “ Jnifm)- 

In other words, 


.^(pW^pW) ^ 7C(Py”\Pj'0 + pen(m)-pen(m) 

(6.18) +7„(/m) - 7„(/o) + 7„(/o) - 7„(/m) + y„ifm) “ 7„(/m). 

The proof of Theorem |4T]ean be deeomposed in three steps : 

(R-1) We prove that for e > 0, 

%o[(7„(/m) -7„(/o))In„.^,(<5)] < 

(R-2) Let be the event 

2 16^ 8 (5 

^i(^) = ^ ~\^'\ + —(l + o) ^{Lm'\in'\ + + -^1 + -j(L„/|m'| + ^) L 

m'eM 


jO) TQlO)', 


where {Lm’)m'eM satisfies Condition (4.2) and nif is given by Definition 4.1 For all 
m' in A\ we prove that on Di(^) 

{7n(fm') - 7n(fm'))^il„,^(6) ^ J 3)(^^ ''' V^)] 

+ 5)(’ ^ S) ^ 

and 

(6.20) P(ni(^)") < 2Se-^. 

(R-3) Let D 2 (^) be the event 


(7„(/o) - 7„(/.')) < ‘^(PS;^p£) - 2 /i 2 (pW,pW) + + 0 


^2(^) = n 

jn' eAi 

We prove that, P(D 2 (^)'^) < 

Now, we will prove the result of Theorem |4.1| using (R-[^, (R-[^ and (R-|^. 

Aeeording to ( |6.18| ), we ean write 

7C(P^;;\ Fp p^”^) + pen(m) - pen(m) 

+(7n(/m) “ 7n(/o))In„,^(<5) + (7«(/o) “ 7«(/m))lln„^(<5) + (7«(/m) “ 7n(fm)^Cl,„^{S)- 
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Combining (R-[^ and (R-|^ with m' = m, we infer that on H 

'KiFf, F^) In,,,® < + pen(m) - pen(m) + (r„(/,J - 7 „(/o)) In„„^ 


■/o’ fm' -r--v ^ jr--v . 


m 


[i + l 


(1 + -) 

(i + hi 

1-2 ' 


3’ 

( s/i 




This implies that 


7C(Pj,P|^)]In,„^(,) < 7C(P}7,P^;) + pen(m) - pen(m) + - r„(/o))]In,„^w 


j(n) „(«) 




(1+5)' 


-I- 


-I- 


[7f(P«, P«) - 2A\P«, P«)) + W“, pp] 

(1+5)2 


Sinee 


we infer 


- «> - «) V (^)h 0« with C(« := 


7C(Pj;\p|>)]In„,^(,) < *7C(Pf ,P^;;;) + pen(m) - pen(m) + - r„(/o))]In„,^(,) 


+ ^C(d)[l + 6Lrn + 8 ^f^\ + -f [2 + (yZ^)(^ + l)(^ + 5)1 

+['^<+/:’) - za/pJ.pS,’) + 


Using Pythagore’s type identity *7C(P/Q,Py^) = *7C(P^”\ P^”^) + *7C(Py"'’,Pj'’) (see Equation (7.42) 


jW ToW'i 


in Massart (20071) we have 


7C(Pj;\p|)]In,„^(,) < *7C(Pf ,P^”)) + pen(m) - pen(m) + (y^ - r„(/o))]In,„^(,) 

\tn\ [ I - 1 44^rl /I -|-d\/ S\/ 4\n 

- 2A=(P';>,P«’) - .^^(P»,p®)]fc.^„, 


Now, we sueeessively use 
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(i) the relation between Kullback-Leibler information and the Hellinger distance 

fm 


2h^(P*;\P^?) (see Lemma 1.23 in Massart (120071)), 

fm 


(ii) and inequality < 2[h\Ff,Ff) + 

70 70 Jm Jm 


Consequently, on Qi(^) f) 

< 7C(P$;\p|j) + pen(m) - pen(m) + - r7/o))In„^0) 

+ ^C(d)[l + 6L,^ + 8 + ^[^ + + ^)(l + ^)1- 

Since pen(m) > /i|m|[l + bL* + 8 VT^j/n, by taking jx = C{6) yields that on Qi(^) H 

2^1/3 ^ 

+ P®"*”* + T"G") “r,(/o))ln.,w) + ^c,(/,). 

Then, using that 

P(f2i(^)'^UQ2(^)0<3Se-7 

we deduce that P(f2i(^) n f22(^)) > 1 - 32e“^. We now integrating with respect to and use 
(R-[T]) to write that 


%0 [^^(P/o ’ P/*) 

Furthermore, since 


€ 


2fx 




1/3 


1/3 

—(7C(P^;;\pJ;_^) + pen(m)) 


+ 


Ki(p,ix,r,e) , C2(jU,I) 


,( 1 + 6 ) 


+ 


, JT ) < 1, by applying Inequality (6.11) we have, 

K2(p,M,T,e) 


%[^^(P/o’P^)In® (5)] < 


,(l+e) 


Hence we conclude that 


+ 


2„1/3 

and minimizing over M. leads to the result of Theorem |4.1[ 

We now come to the proofs of (R-[^, (R-[^ and (R-[^. 

• Proof of (R-[T]) 

We know that 

%o[(r«(/m)-r„(/o))In„^(5)]| = |E/o[(r„(/m)-7„(/o))In;;,^(5) 


/C3(p,//,r, 6 ) C2(jU,S) 


„(l+G 


+ 




1 - I' 
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We conclude the proof of (R-[^ by using Inequality (6.111, which implies that 

ll/oi f M K{p,6J,6) _ K'{p,6,Y,e) 
®'/o[(T«(/«) 7 n(/o))In„^(( 5 )J| - 21og „( 1 + 6 ) ^jU+e) 


• Proof of (R-[^ 

We start by the proof of ( |6.19| ) 






= “Z(E4 

Jem' iej 

By Cauchy-Schwarz inequality, we have 



'izf 

)[ ^ 

'pi-2 


iJ) 




aeal 


1 - 

fm 


7nifm') - 7„i.fm') < 




1 - nf 

E i^'K: (^)+0 - ''2) log’- (^)] 

Jem' ^ 


X 


1 r(Ziey^i) t 

iV^! 


and in other words 


7n(fm') - 7n(fm') < ^JX’ >< 


where Xl^, and are defined respectively in ([^l and (6.13) . Using both that in¬ 

equality 2xy < 6x^ + 6~^y\ for all x > 0, y > 0 with 9 = (I + 5)/{l - S), and Inequality ( 6.12 ), 
we obtain on Q.,nJ6) that, 

y.(U) - r„(/,.-)) < )■ 

Consequently, on Qi(^) 

(r„(/m') - r„(/M'))]In,„^(« < + ^)\m'\ + 8(l -P ^{L„W\ + 

+ ‘7C(P^;\P^?)]In (5). 
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Using inequalities lx + and 2xy < Bo? + B with B = 6!A, we infer that 

(|6.19|) follows sinee 


7n(/m') - r„(/m'))In„,/o < + ^(16 + 8L,„/|m'| + 2d|m'|) 

+8f(l + j)(l + j)] + U_^(P£.P“)Ia,,« 

~ 1-^ + 16 VL/)] 


n ^ 1 — 0 '^^ 3'^ o' 1+0 Jm' 


AS)- 


• Proof of ( |6.20| ) : 

Write Xl, = Yujern'i^u + Zij}, where 

^ 1 (Zte7 Sk? ^ ^ 1 (ZkeJ Skf 

Zij = - 77 — and Z 2 J = - 7J-. 

n \J\7i^p n\j\{\-n^j^) 

Jm' Jni' 

We will eontrol Z/sm' and Yijem' ^ 2,7 separately. In order to use Bernstein inequality (see 
Theoreml^ I, we need an upper bound of X/Em' (< 5 )], for every p > 2. By definition 


1 

E[Zf,]In„^(5)] = -- - 2px^P-^¥[[\Ysk\>x]c^a,n?5))dx. 

(n|7|;r)jj do 

For every m' eonstrueted on the grid nif, for all J 6 m', on 0.mf{S) n |x < | Yikej we have 

T < I ^ ed < ^ ^ T^fy{xi). 

kej iej 

Combining the previous inequality, the Bernstein inequality ( |6.15 1 with the faet that < 1, we 
infer that 


E[Zf y]If2,„ (5)] 


< 


< 


< 


1 


(nI,keJ^fo(^k)) 

1 

(nZkeJ^fo(Xk)f 

1 

(nZkeJ^fo(^k)y ^0 


X OhkeJ^fyW 

2px^P-^¥{\Y^Sk\>x)d. 

keJ 

Ij ieJ ^/o ) 

4px^^“^ exp ( 

0 

f 


AZieJ^foM 


2(f + ZkeJ^foM) 




Apo?^ ^ exp( - 


2(1 + f)Zte7^/o(-^it) 


^0 


1 s 

< -2P-\\ + -)Pp f/’-i expe¬ 
nd 3 Jo 

< 72''*‘M1 + frcp!). 

nP 3 


■t)dt 
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Consequently 

+ \rw X \m'l 

Jem' 

Now, sinee p < we have 


Jem' 





(5 1/^-2 


Using Bernstein inequality and that e[ Yijem' ^ij)\ ^ W\ln, we have that for every positive x 

/ V^'\ 8 5 f ——- 4 d \ 

P( > Zij\ii„,Ad) > —^ + -(1 + t) ^x\m'\ + -(1 -r -)a:) < exp(-A). 

^ f n n 3 n 3 ' 

Jem’ 

In the same way we prove that 

P( Yj ^2,/2n„,^(<5) > — + -(1 + |) ylx\m'\ + -(1 + |)v) < exp(-jc). 

JL /T JL ^ 

Jem' 

Henee 

/ ,7 2|m'| 16 6 /——- 8 5 \ 

Pfclln,,,,^) >-^ + —(1 + -)^]x\m'\ + -(1 -r -)v) < 2exp(-A), 

' j n n 3 n 3 ' 

and we eonelude that P(Qj(^)) < 2 2m' exp(-L^|m'| - ^) = 2'Le~^. This ends the proof of (R-|^. 
• Proof of (R-j^ 

Reeall that y„{f) = 7n(/) - E( 7 „(/)) for every /. Aeeording to Markov inequality, for b > Q, 

P((7„(/o) - Jn^S)) >b) = P( exp (^(7„(/o) - 7nig))) ^ exp (y)) 

^ exp (^)e[ exp (^(7„(/o) - 7„(^)))] 

= exp [^ + log e[ exp (^(7„(/o) - 7n(g)) + ^E[ 7 „(g) - 7«(/o)])] 
exp [-^ + ^9C(P^^J,¥f) + logE[exp(^( 7 „(/o) - 7«(^)))]]- 


< 
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Now, 


n I ^ 7T (jc ) 

logE[exp(^(y„(/o)-7„(g)))] = logE[exp(-^ y;log(^^) + (1 - y,)log( 


/=1 


1 - TTgjXi) 

l-nfy(Xi) 


))1 




= iogn;'.,|J 


^g(Xi) 


^fo(Xi) + 


1 - Kg(Xi) 


1 - TTfyiXi) 


'-(I - TTf^iXi))} 


= ^ log { ^ng{Xi)nfy{Xi) + a/( 1 - 7rg(x;))(l - n^ixi))]. 


i=\ 


In other words we have 

exp (y(7„(/o) - rn(^))) 


« 2 _ _ 2 _ _ 2 

Xi“ o[(V^/o(^<) - + (Vi -^/o(^o - Vi ])• 


/=! 


This implies that 

n ^ ^ 1 2 2 

exp(-(7„(/o)-7/i(^)))] < Xi “9 [( + ( Vl - - Vl - 1 


(=1 


Consequently 




P(r„(/o) - r„(^) >b)< exp + 27C(P' 


l(«) 

/o ’"g 


-n/i2(pW, 


/o ’ g 


and, if we ehoose for positive x. 


b= — + 


/o ’ g 


2h\F^^^,¥f) > 0, 


we have. 




< exp(-x). 


p(7„(/o)-7„(^)>-+‘?^(P5 

' n ■' 

We eonelude that PCQ^C^)) ^ Zm' exp(-L^|m'| - ^) < whieh ends the proof of (R-[^. □ 

7. Appendix 

7.1. Proof of Lemma |4H By definition 


f,n = arg min 


^ log(l + exp(/(x/))) - 7ify{xi)f{xi) 


L 1=1 
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For all / 6 5m, for all 7 6 m and for all a 6 7, we have f{x) = f Henee fm{x) = for all a 

—(J) 

in 7, and for all 7 in m, we aim at finding sueh that 


-iJ) 

fm = arg nun 


|7| log(l + exp(/''^)) - ^ TTfyixdf-'^ 

iej 




where |7| = card{i 6 {1,jc, 6 7}. Easy ealeulations show that he eoeffieient satisfies 

exp(7lf) V . ^ A 

l-^l-ZTTT - 2 j 




that is 
(7.1) 


fm =fog 


1 + exp(/^ ) ieJ 


l-^Kl - I,ieJ^fo(Xi)/\J\) 


Consequently, defined as in (12.21) satisfies that nf^{x) = for all x e J, where 


n 


,iJ) 


jV^fyixd, 


1^1 


ieJ 


and henee = arg minfg 5 „, || t - 7Tf„ \\„ is the usual projeetion of on to 5^ =< diy, j e m> . 


In the same way, fm defined by (14101) satisfies fmit) = fhP for all t e J, where 


fm^ = fog 


ZiejYi 


m(i-Z/ey Wl) 


In other words, defined as nf with / replaeed by ;r^^, satisfies nf^fx) = for all x e J, 
with 


= — y Yi. 
U 171 Z. 


7.2. Proof of Lemma 6.2 In the following, for the sake of notation simplieity, we will use 
j(J3) for A seeond-order Taylor expansion of the funetion y() around /3* gives for any 

fe Am 


y(fi) = y{fi*) + Vpy{fi*){fi-ld*) 

+ r(1-0 y 


r ...{fin-P d) 


I* \lD 


dy^ 


dfii.-.dfin 


(fi* + t(fi - fi*))dt. 
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Easy calculation shows that 

2 ! 


E 


I* \lD 


ii+"-+iD-2 
D n 


zi!...zz)! 




dy^ 


-(J3* + t(fi - fi*)) 


dpx.-.dpD 

;=i ” <=i 
1 " 

- E uxdM^difii -Pim )te))[l -n[fp )(^/))l 

M Z=1 

1 " 

” Z=1 

This implies that 

yifi) > y(J3*) + Vf,yQ3*)Q3-/3*) + -^\\f/, - 

Since fi* is the minimizer of y(.) over the set A„„ we have V^7(jS*)(/3 - fi*) > 0 for all yS 6 A^. 
Thus the result follows. 

7.3. Proof of Lemma [O} Let Sd and Sd' two veetor spaees of dimension D and D' respee- 
tively. Set 5 = 5^) n Lc»(Co) + Sd' C\ Loo(Co) and s' be an independent eopie of s. Set 


(7.2) Z = sup and for all z = 1,..., n, = sup ^ 

u^S II ^ \\n ueS II ^ Im 


1 / 

- > SkU{Xk) + SiU{Xi) 
n 


k+i 


By Cauehy-Sehwarz Inequality the supremum in (7.2) is aehieved at n 5 (^. Consequently, 


Z - 7}^ < 
with 

%o 


(g; - g;)(n5(^(v:,) 

n II risC^ ||„ 

(e; -eP^[n5(^(T;)]2 


and 


Vo 


[(Z - < 


Vo 


(g; - gpvrisC^Cji;,)]' 


rf- II risC^ 


[n5(g)(T,)]^ 
Iin5(^||2 
[n5(g)(T,)]^ 
II Tlsi^ ||2 


Vo 


[(g,- - g;)^|g] 


(e? +E/.(e?)) 


< 


5[n5(g)(Ty)]^ 

4zz2 \\Us(^ wr 


This implies that 




i=l 


We now apply Lemma 7.1 from Boucheron et al. (2004)), that is reealled here. 
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Lemma 7.1. Let Xi,... ,Xn independent random variables taking values in a measurable space 
X. Denote by X" the vector of these n random variables. Set Z = f{X\,... ,X„) and = 
f{X\,... ,Xi-i,X[,Xi+i,... ,Xn), where X[,... denote independent copies ofXi ,... and 
/; /Y” ^ R some measurable function. Assume that there exists a positive constant c such that, 
E/o [Z”=i(Z - < c. Then for all t > 0, 

FfyiZ > EfyiZ) + 0 < 


Applying Lemma 7.1 to Z defined in (7.2), we obtain that for all a > 0, 


(S, U)n _ 
sup --—— > Eyj, 
ueS II ^ lln 


{e, u)n 


sup 

ueS II tl 



5a 

+ a/ — 


< exp (-x). 


Let ..., (Ad+d'} be an orthonormal basis ofS d + S d'- LFsing Jensen’s Inequality, we write 


Vo 


(e, u)n 

sup -—— 

ueS II ^ lln 


(D+D' ) 

1/2-1 

= E/i,(|| n5(^ ||„) = Efy 

Yj^{sAk)nf 

y k=\ , 



This concludes the proof of Lemma|6.3[ 


(D+D' 


< 






V k=\ 


1/2 


D + D' 


An 


References 

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. 
In Second international symposium on information theory, pp. 267-281. Akademinai 
Kiado. 

Arlot, S. and R Massart (2009). Data-driven calibration of penalties for least-squares regres¬ 
sion. The Journal of Machine Learning Research 10, 245-279. 

Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Sta¬ 
tistics 4, 384-414. 

Baraud, Y. (2000a). Model selection for regression on a fixed design. Probability Theory and 
Related Fields 117{A), 467-493. 

Baraud, Y. (2000b). Model selection for regression on a fixed design. Probab. Theory Related 
Fields 117(A), 467-493. 

Baudry, J.-R, C. Maugis, and B. Michel (2012). Slope heuristics: overview and implementa¬ 
tion. Statistics and Computing 22(2), 455-470. 

Birge, L. (2014). Model selection for density estimation with L 2 -I 0 SS. Probab. Theory Re¬ 
lated Fields 755(3-4), 533-574. 





















References 


37 


Birge, L. and P. Massart (1998). Minimum contrast estimators on sieves: exponential bounds 
and rates of eonvergenee. Bernoulli 4(3), 329-375. 

Birge, L. and P. Massart (2001). Gaussian model seleetion. Journal of the European Mathe¬ 
matical Society 3(3), 203-268. 

Birge, L. and P. Massart (2007). Minimal penalties for gaussian model seleetion. Probability 
theory and related fields 138(1-1), 33-73. 

Bontemps, D. and W. Toussile (2013). Clustering and variable seleetion for eategorical mul¬ 
tivariate data. Electronic Journal of Statistics 7, 2344-2371. 

Boucheron, S., G. Lugosi, and O. Bousquet (2004). Concentration inequalities in maehine 
learning summer school 2003. Advanced Lectures on Machine Learning 3176, 169-240. 

Braun, J. V., R. Braun, and H.-G. Muller (2000). Multiple ehangepoint fitting via quasilike¬ 
lihood, with application to dna sequence segmentation. Biometrika 87(2), 301-314. 

Bunea, F. (2008). Honest variable seleetion in linear and logistie regression models via 1 and 
1-1-2 penalization. Electronic Journal of Statistics 2, 1153-1194. 

Castellan, G. (2003a). Density estimation via exponential model seleetion. Information The¬ 
ory, IEEE Transactions on 49(8), 2052-2060. 

Castellan, G. (2003b). Density estimation via exponential model seleetion. IEEE Trans. In¬ 
form. Theory 49(8), 2052-2060. 

Cox, D. D. and F. O’Sullivan (1990). Asymptotic analysis of penalized likelihood and related 
estimators. Ann. Statist. 18(4), 1676-1695. 

Durot, C., E. Lebarbier, and A.-S. Toequet (2009). Estimating the joint distribution of inde¬ 
pendent categorical variables via model seleetion. Bernoulli 15(2), 475-507. 

Ean, J., M. Earmen, and I. Gijbels (1998). Eocal maximum likelihood estimation and infer¬ 
ence. J. R. Stat. Soc. Sen B Stat. Methodol. 60(3), 591-608. 

Earmen, M. W. (1996). The smoothed bootstrap for variable bandwidth selection and some 
results in nonparametric logistic regression. ProQuest EEC, Ann Arbor, MI. Thesis 
(Ph.D.)-The University of North Carolina at Chapel Hill. 

Hastie, T. J. (1983). NONPARAMETRIC EOGISTIC REGRESSION. App/. Stat.. 

Kwemou, M. (2012). Non-asymptotie oraele inequalities for the lasso and group lasso in high 
dimensional logistie model. Teehnieal report, preprint arXiv: 1206.0710. 

Eebarbier, E. (2005). Deteeting multiple change-points in the mean of gaussian process by 
model seleetion. Signal processing 35(4), 717-736. 

Eerasle, M. (2012). Optimal model seleetion in density estimation. Annales de ITnstitut 
Henri Poincare, Probabilites et Statistiques 48(3), 884—908. 

Eu, E. (2006). Regularized nonparametric logistic regression and kernel regularization. Pro- 
Quest EEC, Ann Arbor, MI. Thesis (Ph.D.)-The University of Wiseonsin - Madison. 

Massart, P. (2007). Concentration inequalities and model selection. Volume 1896 of Lec¬ 
ture Notes in Mathematics. Berlin: Springer. Eeetures from the 33rd Summer Sehool on 
Probability Theory held in Saint-Elour, July 6-23, 2003, With a foreword by Jean Pieard. 



38 


References 


Maugis, C. and B. Michel (2011). Data-driven penalty calibration: a case study for gaussian 
mixture model selection. ESAIM: Probability and Statistics 15, 320-339. 

Raghavan, N. (1993). Bayesian inference in nonparametric logistic regression. ProQuest 
LLC, Ann Arbor, MI. Thesis (Ph.D.)-University of Illinois at Urbana-Champaign. 

Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics 6(2), 461- 
464. 

van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Annals 
of Statistics 36(2), 614-645. 

Vexler, A. and G. Gurevich (2006). Guaranteed local maximum likelihood detection of a 
change point in nonparametric logistic regression. Comm. Statist. Theory Methods 35(4- 
6), 711-726. 

Yang, Y. (1999). Model selection for nonparametric regression. Statistica Sinica 9(2), 475- 
499. 

(1) Laboratoire de Mathematiques et de Modelisation d’Evry, Universite d’Evry Val d’Essonne, UMR CNRS 
8071- use INRA, 23 Boulevard de Erance, 91037 Evry 

(2) INRA, UR 341 MIA-Jouy„ Domaine de Vilvert,, E78352 Jouy-en-Josas, Erance 



